One Hot Encoding


In [1]:
dd <- read.table(text="
   RACE        AGE.BELOW.21     CLASS
   HISPANIC          0          A
   ASIAN             1          A
   HISPANIC          1          D
   CAUCASIAN         1          B",
  header=TRUE)

In [2]:
dd


Out[2]:
RACEAGE.BELOW.21CLASS
1HISPANIC0A
2ASIAN1A
3HISPANIC1D
4CAUCASIAN1B

In [6]:
with(dd,
       data.frame(model.matrix(~RACE-1,dd),
                  AGE.BELOW.21,CLASS))


Out[6]:
RACEASIANRACECAUCASIANRACEHISPANICAGE.BELOW.21CLASS
10010A
21001A
30011D
40101B

Including all levels


In [8]:
cbind(with(dd, model.matrix(~ RACE + 0)), with(dd, model.matrix(~ CLASS + 0)))


Out[8]:
RACEASIANRACECAUCASIANRACEHISPANICCLASSACLASSBCLASSD
1001100
2100100
3001001
4010010

One more approach


In [9]:
library(caret)


Loading required package: lattice
Loading required package: ggplot2
Warning message:
: package ‘ggplot2’ was built under R version 3.2.4

In [12]:
trainDummy <- dummyVars(AGE.BELOW.21 ~. , data=dd)

In [14]:
predict(trainDummy, dd)


Out[14]:
RACE.ASIANRACE.CAUCASIANRACE.HISPANICCLASS.ACLASS.BCLASS.D
1001100
2100100
3001001
4010010

Exercise

  1. Do one hot encoding on the bank marketing dataset.
  2. Run logistic regression on that
  3. Compute accuracy metrics. Do you see any difference?
  4. Does one hot encoding impact decision tree? Discuss

In [ ]:


In [ ]: